Exploiting Citation Networks in Large Corpora to Improve Relevance on Broad Queries
Marc-André Morissette • Location: Theater 7 • Back to Haystack 2023
We at Lexum host and manage legal databases: largely unstructured text corpora comprised of millions of documents, each several thousand words long.
In such corpora, broad search queries such as “eavesdropping” or “residential eviction enforcement” are difficult to rank. Thousands of documents discuss these topics intently, but which should the user see first? We assert that ranking by authority is intuitive and meets most users’ expectations.
We have created an algorithm that analyzes a corpus’s citation network and identifies the most cited documents in the context of the user’s query. Heavily cited documents are inferred to be more authoritative. This approach can even rescue relevant documents that were initially missed because they do not contain the query’s terms.
We will present the math behind our algorithm, our Lucene/Solr implementation, and how we put the algorithm into production by merging traditional ranking methods with this new ranking approach.
Download the Slides Watch the VideoMarc-André Morissette
LexumMarc-André Morissette currently serves as the Vice President of Technology at Lexum, a Software-as-a-Service (SaaS) company specializing in online legal information delivery products. In this role, he provides oversight for key technologies and research and development initiatives at Lexum. With a professional focus on natural language systems, Marc-André has developed an expertise on search, information extraction, and automatic summarization. He is currently working on the application of Large Language Models to the legal domain and investigating Chain of Thought Prompting for search. Over the course of his career, Marc-André has led several projects, including designing a search analytics package for legal search engines, creating an abstractive summarization system using pre-trained transformers, improving search relevance on broad queries by leveraging citation networks, deploying distributed search technology, improving search user interfaces, and developing a fuzzy citation recognition system.